-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature][GCS FT] Best-effort redis cleanup job #1766
[Feature][GCS FT] Best-effort redis cleanup job #1766
Conversation
@@ -283,6 +284,19 @@ func (r *RayClusterReconciler) rayClusterReconcile(ctx context.Context, request | |||
"We do not need to requeue the RayCluster CR anymore.", instance.Name)) | |||
return ctrl.Result{}, nil | |||
} | |||
// redisCleanupJob.Spec.ActiveDeadlineSeconds is set by the GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge this if
statement and the if redisCleanupJob.Status.Succeeded > 0
statement?
if utils.IsJobFinished(&redisCleanupJob) {
isBestEffort := redisCleanupJob.Spec.ActiveDeadlineSeconds != nil
if (Job is complete) {
print some logs
} else {
print some logs
maybe file a k8s event
}
remove finalizer
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return false | ||
} | ||
|
||
func EnvInt64(name string, fallback int64) int64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could directly use strconv.ParseInt(v, 10, 64)
in the buildRedisCleanupJob
function. If users input invalid values, it should fail. In open-source projects, we often need to address numerous user questions without access to their Kubernetes clusters. Therefore, identifying and surfacing issues as early as possible might be a better strategy for maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kevin85421,
I find it hard to make the operator fail in the buildRedisCleanupJob
. This function is called only when the operator is going to create the cleanup job, which is already too late.
Considering that the GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS
is an environment variable, the proper timing to fail the operator should be during initialization. How about we parse the env and set the parsed result as a member field of RayClusterReconciler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After our latest discussion, we decided not to introduce the GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS
but always set the .spec.activeDeadlineSeconds
to 300 seconds instead.
9dde125
to
a548d87
Compare
a548d87
to
93aa1a3
Compare
Why are these changes needed?
Following the previous proposal #1557 (comment), this PR introduces a new environment variable
GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS
to the kuberay-operator.This variable will be set to the
.spec.activeDeadlineSeconds
option of the redis cleanup job of a fault-tolerant RayCluster, and it will terminate the job after the deadline no matter what.This PR further makes the kuberay-operator remove the
ray.io/gcs-ft-redis-cleanup-finalizer
finalizer on the RayCluster allowing it to be deleted after the job is terminated.The default value of
GCS_FT_REDIS_CLEANUP_DEADLINE_SECONDS
is 300. This essentially makes all redis cleanup jobs work in a best-effort manner for 5 minutes by default.Related issue number
#1725
#1557
Checks